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ABSTRACT 

In social science research there are a number of instruments 
that use a rating scale such as a Likert response scale. For a number of 
reasons, a respondent's response vector may not contain responses to each 
item. This study investigated the effect on a respondent's location estimate 
when a respondent is presented an item, has ample time to answer the item, 
but decides not to respond to the item. For these situations, different 
strategies have been developed for handling missing data. In this study, four 
different approaches for handling missing data were investigated for their 
capability to mitigate the effect of omitted responses on person location 
estimation. These methods included ignoring the omitted response, selecting 
the "midpoint" response category, Hot-decking, and a likelihood-based 
approach. A Monte Carlo study was performed and the effect of different 
levels of omission on the simulee's location estimates was determined. 

Results show that the Hot-decking procedure performed the best of the methods 
examined. Implications for practitioners were discussed. (Contains 6 figures 
and 10 references.) (Author/SLD) 
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Abstract 

In social science research there are a number of instruments that utilize a rating scale such as a Likert 
response scale. For a number of reasons a respondent’s response vector may not contain responses to each 
item. This study investigated the effect on a respondent’s location estimate when a respondent is 
presented an item, has ample time to answer the item, but decides to not respond to the item. For these 
situations different strategies have been developed for handling missing data. In this study, four 
different approaches for handling missing data were investigated for their capability to mitigate 
against the effect of omitted responses on person location estimation. These methods included Ignoring 
the omitted response, selecting the ’’midpoint” response category. Hot-decking, and a Likelihood-based 
approach. A Monte Carlo study was performed and the effect of different levels of omissions on the 
simulees’ location estimates was determined. Results showed that the Hot-decking procedure 
performed the best of methods examined. Implications for practitioners were discussed. 
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The effect of missing data on estimating a respondent's location using ratings data 

In social science research there are a number of instruments that utilize a rating scale such as a 
Likert response scale. For a number of reasons a respondent's response vector may not contain responses 
to each item. Using Little and Rubin's (1987) terminology nonresponses that arise from an a priori 
decision to not administer certain (e.g., in the adaptive administration of an instrument or when 
respondents are directed to answer only relevant items (see Schulz & Sun, 2001)) represent conditions in 
which the missingness process may be ignored for purposes of estimating the person's location on the 
latent continuum of interest (Mislevy & Wu, 1988; Mislevy & Wu, 1996). In contrast, nonresponses for 
"not-reached" item(s) occur because an respondent has insufficient time to even consider responding to 
the item(s). Assuming the subject responds to the items in serial order these not-reached items can be 
identified as collectively occurring at the end of an instrument. Another source of missing data occurs 
because respondents have the capability of choosing not to respond to certain items on an instrument. 
These (intentionally) omitted responses represent nonignorable missing data (Lord, 1980; Mislevy & 

Wu, 1988; Mislevy and Wu, 1996). This latter condition is referred to as missing not at random (MNAR). 
This study investigated the effect on a respondent’s location estimate when a respondent is presented 
an item, has ample time to answer the item, but decides to not respond to the item (i.e., the MNAR 
case). 

Different strategies have been developed for handling missing data. For example, respondents 
with missing data may be dropped so that one performs a complete-case analysis (Groves, Dillman, 

El tinge, & Little, 2002). Alternatively, one may replace the missing values by ’estimates' to produce 
’complete data’ and these are then analyzed by standard methods. The replacement of the missing 
values by estimates is known as imputation. A commonly used approach replaces the missing value 
with the mean of the variable (i.e., mean substitution). A second strategy is hot-deck imputation 
(Hanson, 1978, cited in Groves, Dillman, Eltinge, & Little, 2002). Hot-decking is based on matching the 
respondent with the omitted response(s) to another individual based on variables (e.g., items) that are 
observed for both persons (if there are multiple matching candidates, then an individual is selected at 
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random). The omitted responses are replaced with the responses from the matched individual. Both of 
these strategies are considered to be single imputation methods. A single imputation method is an 
approach where each missing value is replaced by a plausible value and then the 'complete' data are 
analyzed (Sinharay, Stern, & Russell, 2001). While specific single imputation methods may have 
specific disadvantage(s), a general disadvantage of single imputation methods is that they cannot 
represent all of the uncertainty about which value to impute (Groves, Dillman, El tinge, & Little, 2002). 
To address this disadvantage of single imputation methods multiple imputation has been developed. 

In multiple imputation a set of M datasets are created, each containing different sets of imputation of 
missing values (Groves, Dillman, Eltinge, & Little, 2002). Each of these M datasets is analyzed and the 
results across the M analyses are combined to produce an estimate plus an assessment of its variability. 
In contrast to imputation methods, maximum likelihood (ML) utilizes a stochastic model and makes 
inferences based on the likelihood function of the incomplete data (Groves, Dillman, Eltinge, & Little, 
2002). When data are missing at random the likelihood approach yields valid inferences about the 
relevant parameters (Groves, Dillman, Eltinge, & Little, 2002; Sinharay, Stern, & Russell, 2001). 

This study was concerned with the accuracy of person location estimates when respondents choose 
not to answer one or more items on an instrument that uses rating scales. The responses to affective or 
attitudinal instruments that use a rating scale, such as, a Likert response scale, may be modeled using 
Andrich's (1978a, 1978b) rating scale model (RSM). The RSM states that the probability of responding 
in category x i of an (m+l)-category item i can be obtained by 
exp f, exp(0 - (b[ + Tj)) 

P(*ik 16) = m i= ° , (1) 

L expi; (0 - (bi + Tj)) 
k=0 j=0 

where 0 is the person location on the latent continuum being measured by the instrument and b[ is the 
item's location on the same continuum. Tj represents the threshold or the location of the transition 
from one response category to the next; to = 0. Therefore, there are m ts estimated for the m+1 response 
categories across all items. 

The RSM is a member of the Rasch family and, as such, the RSM assumes items are equally 
effective at discriminating among examinees. Moreover, the unweighted sum of the respondent's 
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responses (scale score) is a sufficient statistic for estimating the respondent s location. Therefore, it 
might be expected that because omitting responses affects the individuals' scale score that estimation 
of the person's location will be adversely affected. 

Four different approaches for handling missing data were investigated for their capability to 
mitigate against the effect of omitted responses on person location estimation. These methods included 
Ignoring the omitted response, selecting the "midpoint" response category. Hot-decking, and a 
Likelihood-based approach. 

Ignoring the omitted response had the effect of reducing the number of items used for estimating the 
person's location and thereby affecting the respondent's sufficient statistic for location estimation. This 
strategy of ignoring nonignorable missing data assumes that the omissions do not contain any useful 
information for estimating the respondent's location. 

Replacing the omitted response with the "midpoint" response category (in effect, assuming the 
response is neutral-like) does not reduce the number of items used in calculating the sufficient statistic. 
However, to the extent that this 'neutral' response is not reflective of the respondent's true response 
(e.g., strongly disagreeing with an item) this approach may introduce additional measurement error. 

The Hot-decking strategy selects a respondent (say, B) who is most similar to the respondent with 
the missing response(s) (say, A) in terms of the respondent's string , but who has also answered the item 
that respondent A did not respond to. Respondent B's response to the item in question is used for 
respondent A's response to the item. 

In the Likelihood approach the various possible responses are substituted for each omitted response 

A 

and the likelihood of that response pattern is calculated conditional on the location estimate, 0, 
corresponding to the response vector's sufficient statistic. For instance, let us say that the respondent 
has omitted one item and there are four possible response options (1, 2, 3, 4). In this approach the 
omitted response would be replaced a response of 1 and the likelihood based the corresponding 

A 

sufficient statistic's 0 calculated. Then the omitted response would be replace by a response of 2 and the 
likelihood recalculated and so forth for responses of 3 and 4. The 0 associated with the largest of the 

A 

four likelihoods was taken as the 0 . Obviously, as the number of omissions increases the number of 
combinations of potential responses also increases. This strategy attempts to determine what the most 
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likely responses should be. It is assumed that if a respondent does not feel that an item is not 
applicable to him or her, that he or she is presented the opportunity to select a 'not applicable’ 
category and, therefore, omissions are a function of a desire not to answer a particular question (e.g., the 
question is of a sensitive nature). 



Method 

Data Generation : 

The simulation data were modeled on a empirical data set. This empirical data set consisted of 
4282 respondents. These data consisted of responses to a questionnaire concerning sexual behavior and 
was administered as part of an HIV Counseling and Testing program. Fifteen four-point Likert scale 
(l=strongly disagree to 4=strongly agree) questions concerning opinions about condom use formed the 
scale of interest. Because a respondent may omit an item as a function of many different factors (e.g., 
uncomfortableness with the question, etc.) and there were no explicit measures of these factors it was 
decided to not use a parametric approach for modeling the empirical data. Because the omission 
pattern across the scale scores differed for persons who responded in one category (e.g., strongly agree) 
versus another response category on an item, four contingency tables were created for each item using the 
4282 respondents. Each contingency table consisted of a two-level response type variable versus the 
scale score variable. The two-level response type variable reflected omission and one of the response 
alternatives. For example, for one table the response type variable consisted of response omission and 
responding strongly disagree, for a second table the response type variable consisted of response 
omission and responding disagree, etc. The scale score was transformed into deciles. Based on these 
tables the proportion of individuals omitting a response to an item conditional on the fractile were 
calculated. Some tables had cells with zero frequencies. In these cases, a value of 0.5 was substituted 
for the zero frequency before calculating the proportions (i.e., resulting 'frequency' was 0.005). 

The generation of the simulation data required item parameter estimates. Using only individuals 
that had complete response strings (N = 3473), BIGSTEPS (Linacre & Wright, 2001) was used to obtain 
item parameter estimates for the RSM. 
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The simulated data were generated on the basis of the RSM and the item parameter estimates of 
the empirical data were treated as known. For each 0.1 of logit from -2.0 to 2.0 (inclusive) 1000 0s were 
generated for a total of 41,000 simulees. For each simulee the probability of a response in each category 
was calculated according to the RSM. These probabilities were then accumulated across response 
categories and compared to a uniform random number [0,1]. If the random number was less than or equal 
to probability of a category's cumulative probability, then that category’s ordinal position was the 
response for the item. To generate the omission data, the scale score for each simulees response vector 
was determined and the simulee assigned to one of the ten fractiles. For each item the simulee’s 
response was used to determine which of the four contingency tables for the item should be used. Based 
on the simulees fractile assignment the appropriate relative frequency of omission was compared to a 
uniform random number [0,1]. If the uniform random number was less than or equal to the relative 
frequency for omission, conditional on the simulee’s fractile, then the response was changed to be an 
omission, otherwise the simulees response to the item was not changed. For example, for an item the 
relative frequency of omission for an respondent in the third fractile might be 0.40, 0.30, 0.20, 0.10 for 
the strongly disagree, disagree, agree, strongly agree categories, respectively. If the simulee's 
generated response to the item was strongly disagree, then a uniform random number would be generated 
and compared to 0.40. If this random number was, for instance 0.3, then the simulee's response to this 
item would be changed to reflect that it had been omitted. This process was repeated for each of the 15 
items and for all simulees. Therefore, each simulee had a complete response vector and a response 
vector containing omitted responses (a.k.a., the omission vector). 

A 

Ability Estimation : For each simulee, an 0 based on the complete response vector and another based on 
the omission vector was obtained using maximum likelihood estimation (MLE) with the RSM. For the 
omission vector the various imputation methods described above were used to impute the missing 

A 

response and then MLE was used for location estimation. Therefore, each simulee had a 0, an 0 based on 

A A 

the complete response vector, an 0 based on Ignoring the omitted response(s), an 0 based on using the 

A A 

Midpoint category for the omitted response(s), an 0 based on Hot-decking, and an 0 based on using the 
Likelihood strategy. 
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Analysis : Independent and Dependent Variables : 

Each level of imputation method factor was crossed by the number of items omitted in the response 
vector (Nomitted). The Nomitted factor consisted of three levels: 1, 2, and 3 omitted responses. These 
three levels of Nomitted, 1, 2, and 3, represent approximately 7%, 13%, and 20% of the test length, 
respectively. The dependent variables were the various location estimates. 

To assess the effect of omission on the accuracy of the person location the Root Mean Square Error of 
the estimate (RMSE) and bias were calculated with respect to the simulees known location. In 
addition, RMSE and bias were calculated for the location estimate obtained using the complete response 
data. Because of the way the locations were generated it was possible to investigate the effect of 
omitted responses as a function of location as well as across the ability scale. RMSE was calculated 
according to: 



where 0: location estimate based on one of the estimation methods using either the 

A * A 

complete data (0 C ) or missing data (0^ 

0k: simulees location at logit k (-2.0, -1.9, -1.8, ..., 2.0) 
n: the number of simulees at logit k 

RMSE and Bias were calculated separately for the complete vectors and omission vectors. Because 
RMSEs for the complete vectors represented how well the simulees’ locations could be estimated on the 
basis of complete response data, the RMSEs for the omission vectors were compared to the corresponding 
RMSEs for the complete vectors; this was also true for Bias. These absolute differences between the 
RMSE for the omission and complete vectors as well as the difference between Bias based on the omission 
and complete vectors were examined graphically for each condition. Only 0 points with at least 10 




(2) 




( 3 ) 
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observations were plotted. All statistics were calculated using convergent cases with listwise deletion of 
missing data. 



Results 

The item locations were distributed between -0.98 and 0.86 with = -0.30, T2 = -0.02, and 13 = 0.31. 
The number of simulees that omitted 1, 2, and 3 items were 7040, 3224, and 2994, respectively. For the 
one omit level the average trait values was 0 = 1.0421 (SD = 0.9721; Skew = -0.4402), for the.two-omit 



level 0 = -0.1911 (SD = 1.2181; Skew = -0.2872 ), and for the three-omit level 0 = -1.1971 (SD = 1.0550; 

Skew = 0.1027). Table 1 contains descriptive statistics, the fidelity coefficients based on complete data 
(rQQ ) and that based on missing data (1*00 q ), as well as the correlation between the location estimate 

based on complete data and that based on missing data ( r 0 c 0 o )* As wou ld be expected, ^g^s for a given 

level of Nomitted were always less than the corresponding ^g for that level of Nomitted. With 

A 

respect to missing data strategy, the 0 Q s had the strongest linear relationship with the 0s under the 



Hot-decking approach for the 1 and 2 omit levels. For three omits the Midpoint strategy yielded the 

largest fidelity coefficient. However, the difference between the largest and smallest fidelity 
coefficients (i.e., r 00 Q ) across levels was always less than 0.02. 



Insert Table 1 about here 



While the fidelity coefficients indicate the degree of linear agreement between two scales, they do 

A 

not indicate the accuracy of estimation. To assess the accuracy of location estimation RMSE(0) and 

A A 

Bias(0 ) were calculated. Because the RMSE(0) based on complete response data indicates how well one 

A 

can expect to do with this item pool, the RMSE plots represent the difference between the RMSE(0) 
based on the complete response data value and that based on the response vectors with missing data. 

A 

These RMSE(0) differences as a function of 0 for the various missing data strategies are presented in 
Figures 1-3 for the one-, two-, and three-omit levels, respectively. From Figure 1 one sees that there 
appears to be little difference between Ignoring omits, the Midpoint, and the Hot-decking methods for 
the upper half of the 0 continuum. Moreover, for this portion of the 0 continuum the Likelihood method 
did not function as well as the other strategies. In the lower half of the 0 continuum the Midpoint, the 
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Hot-decking, and the Likelihood methods were very similar to one another. It appears that Hot- 
decking performed the best for the greatest range 0 for the one-omit level. 

Insert Figure 1 about here 



Figures 2 and 3 show that despite the large fidelity coefficients for the Ignore Omit(s) missing data 
strategy, that this method does not yield accurate 0 s for most of the 0 scale for both the two- and three- 
omit conditions. As was the case for the one-omit level, the Hot-decking procedure had, in general, 
RMSE(0)s that agreed better with those based on complete data than did the other methods for 0s 
above -0.9 for both the two- and three-omit conditions. The Midpoint missing data strategy performed 
almost as well as the Hot-decking procedure, with the Likelihood strategy performing worse than both 
the Midpoint and Likelihood procedures. 

Insert Figures 2 and 3 about here 



A A 

Similar to RMSE(0 ), the Bias(0) based on the complete response data indicates how well one can 
expect to do with this item pool, therefore the plots represent the difference between this value and 

A 

the Bias(0 ) based on the missing data response vectors. The Baseline represents perfect agreement 

A 

between the Bias(0) based on complete data and that using a missing data procedure for location 

A 

estimation. With respect to Bias(0) and the one-omit level (Figure 4) one finds that the Hot-decking 
missing data procedure exhibited greater agreement with the bias based on complete data than did the 
other methods. The Midpoint strategy also showed similar bias to that of the complete data results 
between -1.0 and 1.0. Ignoring the missing data introduced substantially more overestimation bias than 

A 

was found with the complete data. The Likelihood approach tended to yield 0s that were larger than 
those based on complete data throughout the 0 continuum. 

Insert Figure 4 about here 



For the two-omits level (Figure 5) the Hot-decking performed similar to that observed with the 
complete data and better than the other missing data procedures between approximately -1.0 to 1.5. As 
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was the case with the one-omit level, the Midpoint strategy performed almost as well as Hot-decking 
in terms of agreeing with the Bias(0) based on complete response vectors. In general, the Likelihood 
procedure performed similar to the Midpoint and Hot-decking procedures between -0.5 and 0.2 and 
better than Ignoring the omitted responses. However, across the 0 continuum the Likelihood strategy 
did not do as well as Hot-decking or using the midpoint value. As was the case with the one-omit level, 

A 

Ignoring the omits resulted in 0 s that were smaller than those based on complete data. 

Insert Figure 5 about here 



The results for the three-level omits (Figure 6) followed that of the two-level omits. Above 0 = -1.0 
Bias(0) based on Hot-decking the missing data vectors agreed well with that from the complete 
response vectors. The Midpoint strategy also performed reasonably well above 0 = -0.3 with the 
Likelihood approach performing similarly or slightly better as 0 increased above -1.0. Compared to 
the Bias(0) based on complete data, Ignoring the omitted responses performed substantially worse than 
the other strategies for most of the continuum. 

Insert Figure 6 about here 



Discussion 

Respondents choose to not answer certain questions for a variety of reasons. This study used a 
nonparametric approach with empirical data in order to provide realistic guidance for generating 
simulation data to ascertain the effect of omission on location estimation with the RSM. 

As would be expected, as the number of omissions increased the accuracy of § decreased for a given 
missing data strategy. The above results seem to indicate that with the RSM omits should not be 
ignored. This was' particularly true for the two and three omit conditions. While Ignoring omits 

A 

yielded reasonably accurate 0 s in terms of RMSE (see Figure 1), Figure 4 showed that these estimates 
exhibited underestimation bias throughout the 0 continuum. Of the imputation methods, Hot-decking 

A A 

appeared to be, overall, the best strategy to use in terms of producing RMSE(0) and Bias(0) that agreed 
with that obtained from complete data. It should be noted that given the logit range represented in the 
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A A 

item pool, the RMSE(0 ) and Bias(0) below -1.0 and above 1.0 may only be indicative, but not very 
stable. 

Although the Midpoint imputation method introduced measurement error, from an estimation 

a . 

perspective using the Midpoint strategy produced reasonably good 0s for one- and two-omit conditions. 
This strategy did not performed as well in the three omit condition as it had in the one- and two-omit 
conditions. This may be due the cumulative effect of the introduction of measurement error with each 
imputation. For instance, comparatively speaking, imputing a neutral response for the first omit may 
not be that deleterious in terms of the amount of measurement error introduced (i.e., whenever the 
actual response was not neutral then imputing a neutral response introduced error). However, as one 
imputes more neutral responses then the sufficient statistic becomes more distorted due to the increased 

A 

measurement error. As a result, the accuracy of the corresponding 0 is degraded. 

Theoretically, it was expected that the Likelihood approach would have performed better than it 
did because it attempted to determine which was the most likely response pattern based on the current 
complete information. However, this expectation was not realized. While under some conditions the 
Likelihood strategy performed comparable to the Midpoint strategy (e.g., the two-omit level between 
-0.7 and 0.2 as well as the three-omit condition between -0.25 and 0.25), the Likelihood never performed 
as well as Hot-decking across the 0 continuum. In the Likelihood approach each omit was replaced by 
one of the possible responses and the likelihood of the response vector recalculated based on the 

A 

response vector’s 0. As is well-known certain response patterns (e.g., Guttman patterns) have a higher 
likelihood of occurrence than other response patterns. It was this property that was expected to be 
exploited by the Likelihood approach. These results seem to indicate that the most likely response 
pattern did not correspond to the observed complete response pattern. This may be due to the stochastic 
nature of the data. 

The results indicate that when using the RSM and MLE for location estimation, Hot-decking may be 
the preferred approach to use with missing data. It should be noted that for this study the data were 
generated, in part, according to the RSM. To the extent that a nonRasch family model more accurately 
describes the data, then one may not observe results comparable to those seen here. 
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Table 1: Descriptive Statistics and Fidelity Coefficients 



Omits 


Method 


r A 

r 90 c 


r A 
r 00 o 


-A A 
r 0 c 0 o 


o c a 


0 o 


so 


Skew Q 


1 omit 


Ignore 


0.9264 


0.9116 


0.9820 


1.0192 


0.7703 


0.7742 


-1.4528 




Midpoint 




0.9175 


0.9867 




0.9641 


0.8151 


-0.6997 




Hotdeck 




0.9225 


0.9894 




0.9939 


0.8453 


-0.6257 




Likelihood 




0.9095 


0.9820 




1.1709 


1.0153 


-0.4355 


2 omits 


Ignore 


0.9586 


0.9458 


0.9846 


-0.2240 


-0.4905 


1.1498 


-0.8751 




Midpoint 




0.9555 


0.9807 




-0.1155 


0.9452 


-0.1271 




Hotdeck 




0.9563 


0.9777 




-0.1323 


1.0069 


-0.1116 




Likelihood 




0.9469 


0.9826 




-0.1197 


1.3632 


-0.2977 


3 omits 


Ignore 


0.9237 


0.9083 


0.9697 


-1.2526 


-1.5731 


1.0808 


-0.1191 




Midpoint 




0.9174 


0.9553 




-0.8162 


0.6512 


0.5453 




Hotdeck 




0.9014 


0.9313 




-0.9557 


0.7470 


0.4638 




Likelihood 




0.8976 


0.9608 




-1.3282 


1.1907 


0.0067 



a l omit: s c = 0.9187, Skew c = -0.7492; 2 omits: s c = 1,2282, Skew c = -0,5589; 
3 omits: s c = 1,0888, Skew c = -0.1077 
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Figure 1: One Omit, RMSE 
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Figure 2: Two-Omits, RMSE 
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Figure 3: Three-Omits, RMSE 
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Figure 4: One-Omit, Bias 
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Figure 5: Two-Omits, Bias 
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Figure 6: Three-Omits, Bias 
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